AITopics | visual region and textual concept

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Neural Information Processing SystemsDec-25-2025, 19:32:45 GMT

In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on two representative vision-and-language grounding tasks, i.e., image captioning and visual question answering. In both tasks, the semantic-grounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications.

aligning visual region, semantic-grounded image representation, visual region and textual concept, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.83)

Add feedback

Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Neural Information Processing SystemsJan-26-2025, 03:02:31 GMT

This paper describes a method for integrating visual and textual features within a self-attention-like architecture. Overall I find this to be a good paper presenting an interesting method, with comprehensive experiments demonstrating the capacity of the method to improve on a wide range of models in image captioning as well as VQA.The analysis is informative, and the supplementary materials add further comprehensiveness. My main complaint is that the paper could be clearer about the current state of the art in these tasks and how the paper's contribution relates to that state of the art. The paper apparently presents a new state-of-the-art on the COCO image captioning dataset, by integrating the proposed method with the Transformer model. It doesn't, however, report what happens if the method is integrated with the prior state-of-the-art model SGAE -- was this tried and shown not to yield improvement?

aligning visual region, semantic-grounded image representation, visual region and textual concept, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence (0.61)
Information Technology > Sensing and Signal Processing > Image Processing (0.40)

Add feedback

Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Neural Information Processing SystemsJan-26-2025, 03:02:20 GMT

The paper proposes a new method called Mutual Iterative Attention (MIA) for improving the representations used by common visual-question-answering and image captioning models. MIA works by repeated execution of'mutual attention', a computation that is similar to the self-attention operation in the Transformer model, but where the lookup ('query') representation is conditioned by information from the other modality. Importantly, the two modalities involved in the MIA operation are not vision and language, they are vision and'textual concepts' (which they also call'textual words' and'visual words' at various points in the paper). These are actual words referring to objects that can be found in the image. The model that predicts textual concepts (the'visual words' extractor) is trained on the MS-COCO dataset in a separate optimization to the captioning model Applying MIA to a range of models before attempting VQA or captioning tasks improves the scores, in some cases above the state-of-the-art. It is a strength of this paper that the authors apply their method to a wide range of existing models and observe consistent improvements.

semantic-grounded image representation, textual concept, visual region and textual concept, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.57)
Information Technology > Sensing and Signal Processing > Image Processing (0.40)
Information Technology > Artificial Intelligence > Vision (0.40)

Add feedback

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Neural Information Processing SystemsOct-10-2024, 15:25:41 GMT

In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities.

aligning visual region, semantic-grounded image representation, visual region and textual concept, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Sensing and Signal Processing > Image Processing (0.45)

Add feedback

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Liu, Fenglin, Liu, Yuanxin, Ren, Xuancheng, He, Xiaodong, Sun, Xu

Neural Information Processing SystemsMar-18-2020, 23:17:31 GMT

In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities.

aligning visual region, semantic-grounded image representation, visual region and textual concept, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Sensing and Signal Processing > Image Processing (0.45)

Add feedback

Filters

Collaborating Authors

visual region and textual concept

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Reviews: Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations